Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: Phoenix_Sandman Date: Mar 17, 2008 7:58am
Forum: web Subject: Robots.txt Policy is a Failure!

The current Robots.txt policy is UNJUST! The failure exists in the policy of obeying the robots.txt instructions of the ->current<- owner the the website, which are often someone like "Network Solutions", who have nothing to do with the previous owner of the site during the time we are interested in, and do NOT know what that previous owner's intentions are, or were, as far as Archiving goes.

This is ridiculous, because too many sites have been unjustly shut down or driven out of business, and then some new owner of the domain name haphazardly posts a restrictive robots.txt which blocks access to content which the sites which the original creators would often prefer to have available!

It is obvious that in most cases, following the robots.txt of the domain's owner AT THE TIME OF EACH ARCHIVING of a site would serve everyone's interests much better!

Say a domain name has changed hands four times, and both the second and current (fourth) owners did not wish their sites archived, and had robots.txt with that instruction, then do not archive them during those periods.

However, the first and third owners did want their sites archived and shared, so their content should still be made available! There can be no argument from current owners because the consent to archive covers a period during which the current owners had no controlling interest in the site. Just have the the Wayback Machine check the ARCHIVED robots.txt for any requested period first, to see if it was blocked AT THAT TIME or not.

If a site's ->current<- owner changes it's robots.txt, and ownership has not changed, make it retroactive to the time at which they took ownership. Easy!

As it is, way too many sites are being blocked for no good reason at all. If you don't think we are right, with over 14 years experience in the computer business, then just run a poll and ask members yourself. Then you'll see how far spread this problem is. We have too much censorship already!

Thank you for reading this.

Reply to this post
Reply [edit]

Poster: DannyDaemonic Date: Apr 27, 2008 12:59pm
Forum: web Subject: Re: Robots.txt Policy is a Failure!

Hey that's exactly what I was thinking. This isn't fair at all. The problem with your new approach is that sometimes people will want their previous pages taken down, due to some embarrassment or vandalism.

Probably the best solution to this would be, as you suggested, never retroactively remove pages due to a robots.txt, but also, to copy google's page verification (where you host a webpage with a certain name) and once you're verified allow you to delete any or all dates archived. This prevents owners from destroying their history with a bad robots.txt, and allows people to remove just the vandalized date and not everything all at once.

Reply to this post
Reply [edit]

Poster: mucizeurunler Date: Jul 20, 2008 1:19pm
Forum: web Subject: Re: Robots.txt Policy is a Failure!

Robots.txt file is very important for GoogleBots..

[example website]

This post was modified by Detective John Carter of Mars on 2008-07-20 20:19:34

Reply to this post
Reply [edit]

Poster: DannyDaemonic Date: Jul 3, 2008 2:37pm
Forum: web Subject: Re: Robots.txt Policy is a Failure!

Apparently you don't know what internet archive's robots.txt policy is. This has nothing to do with google or search engines. No one here is trying to stop robots.txt from doing what it's suppose to. We are simply unhappy with the way Internet Archive uses robots.txt. If someone tells robots not to index their search page, Internet Archive will delete that page from it's archive -- not just that version of the page, ALL previous versions of the page. And it's gone forever, so if you're a webmaster and you want to block all these search bots because you're under heavy load you'll accidentally and permanently erase all Internet Archive copies of your website. Also, sometimes sites change hands, new owners won't want their pages backed up by Internet Archive, but in order to do that they have to also erase all previous archives.

This policy is what we're calling a failure. Robots.txt can delete all previous archives of a page, there's no fine grain control, and it's too easy for a website to accidentally erase it's entire history.

Reply to this post
Reply [edit]

Poster: Robin_1990 Date: May 7, 2008 8:25am
Forum: web Subject: Robots.txt Policy makes little baby Gohan cry!

It's true.

Reply to this post
Reply [edit]

Poster: MSRX92 Date: May 7, 2008 1:20pm
Forum: web Subject: Re: Robots.txt Policy makes little baby Gohan cry!

Surely there is a different way to do this. Like, if I take a picture of someone's house, and the house is sold, the new owner can't demand that I no longer show anyone a picture of that house.

Reply to this post
Reply [edit]

Poster: DannyDaemonic Date: May 8, 2008 12:38am
Forum: web Subject: Re: Robots.txt Policy makes little baby Gohan cry!

This is the best idea I can come up with, and it's certainly better than what we currently have. And some might argue it's completely fair. I'm not one of those people but I could see this making a certain group of people 100% happy. Perhaps even the lawyers would ok this one.

The major problem my solution solves is people putting up robots.txt's that deny all robots in an attempt to save on bandwidth while they decide what to do with their website, or squat on it, or resell it, or whatever. The accidental side effect is they delete their archive history for the whole site.

Knowing this has been stopped should comfort you, a least a little since there are no more accidental erasures (and most people don't even know about the internet archive). Of course you are always free to archive your own website on CD or DVD or whatever medium you choose. That will have to be your picture for now.